Skip to content

Add DType::Union and UnionVariants#7890

Closed
connortsui20 wants to merge 1 commit into
developfrom
ct/add-union
Closed

Add DType::Union and UnionVariants#7890
connortsui20 wants to merge 1 commit into
developfrom
ct/add-union

Conversation

@connortsui20
Copy link
Copy Markdown
Contributor

@connortsui20 connortsui20 commented May 11, 2026

Summary

Tracking issue: #7882

Adds the scaffolding and boilerplate for the new logical DType::Union type.

This is the first PR of probably a bunch, so it simply adds the DType::Union variant and the UnionVariants type that DType::Union holds. Other than that, it is mostly just proto and flatbuffers code and other boilerplate that is easy to implement.

A lot of the code is essentially duplicated from implementations of Struct. It could have been nice if we could reuse StructFields, but we need to additionally store a type IDs field that maps the different variant types to tag indexes (i.e. 0, 1, 2, 5), so we might as well have a new type for this that is named more appropriately.

Also note that the tags do not have to be consecutive (to support potential schema evolution, and also it does make things like casts easier).

I've also enforced that variant names are distinct. The Arrow spec does not make any guarantee about this, but I've chosen the sane path that DuckDB also chooses.

Finally, I reordered some of the dtype match statements so that struct and union are always next to each other after list and fsl. This did cause a bit of extra churn... but with so many types now I think it is important that similar things are next to each other.

Edit: There's 1000+ lines of flatbuffers diff because I think my local flatc install is different from the one on CI?

API Changes

Adds Union variant to DType

Testing

Basic testing (mostly for the serialization roundtrips).

@connortsui20 connortsui20 added the changelog/feature A new feature label May 11, 2026
@connortsui20 connortsui20 marked this pull request as ready for review May 11, 2026 21:16
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 11, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

✅ 1210 untouched benchmarks


Comparing ct/add-union (bd308d7) with develop (3e93048)

Open in CodSpeed

@connortsui20 connortsui20 force-pushed the ct/add-union branch 3 times, most recently from 4e8802c to 4e86fc3 Compare May 11, 2026 21:32
Comment on lines +303 to +305
DType::Union(variants, _) => variants
.variants()
.all(|variant| supports_uncompressed_size_in_bytes(&variant)),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to remove this until the actual uncompressed_size_in_bytes is implemented later?

Comment thread fuzz/src/array/mod.rs Outdated
Comment on lines +191 to 194
// TODO(connor): This seems wrong...
DType::FixedSizeList(_, list_size, _) => value.as_list().len() == *list_size as usize,
// TODO(connor): This also seems wrong...
DType::Struct(struct_fields, _) => value.as_list().len() == struct_fields.nfields(),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@connortsui20
Copy link
Copy Markdown
Contributor Author

I think my local version of flatc is wrong? Or rather its just different from CI

❯ flatc --version
flatc version 25.12.19

@connortsui20
Copy link
Copy Markdown
Contributor Author

#7899

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Comment on lines +28 to +32
/// Per Arrow's spec, the per-row type tag is an `int8`. By default, tag `i` selects the child at
/// offset `i` (`type_ids = [0, 1, ..., N-1]`). Schemas may also use non-consecutive tags (e.g.
/// `[0, 5, 7]`), in which case the value of `type_ids[i]` is the tag used in the data to select the
/// child at offset `i`. Supporting non-consecutive tags from v1 lets the schema remove children
/// without renumbering the remaining tags.
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why i8 not u8?

Its worth explaining why we have type_ids and non-consecutive tags too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow uses i8 and the maximum number of variants is 128 so this is fine imo

Comment on lines +34 to +37
/// Variant names must be distinct. Unlike [`StructFields`](crate::dtype::StructFields), which
/// permits duplicate field names for Arrow/Parquet round-trip fidelity, duplicates have no
/// meaningful semantics in a sum type and are rejected at construction.
///
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we cannot round trip arrow?

If you use index access not name then it does make sense to have duplicate fields. Fields are just metadata in arrow.

/// // Accessing a field by name will yield the first
/// assert_eq!(fields.field("int_col").unwrap(), DType::Primitive(PType::I32, Nullability::Nullable));
/// ```
#[allow(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect please

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect doesnt work here (for some reason, I have no idea)

@connortsui20
Copy link
Copy Markdown
Contributor Author

@joseph-isaacs wants to split this out, so I will remove UnionVariants from this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants